124 research outputs found
Interpreting Deep Visual Representations via Network Dissection
The success of recent deep convolutional neural networks (CNNs) depends on
learning hidden representations that can summarize the important factors of
variation behind the data. However, CNNs often criticized as being black boxes
that lack interpretability, since they have millions of unexplained model
parameters. In this work, we describe Network Dissection, a method that
interprets networks by providing labels for the units of their deep visual
representations. The proposed method quantifies the interpretability of CNN
representations by evaluating the alignment between individual hidden units and
a set of visual semantic concepts. By identifying the best alignments, units
are given human interpretable labels across a range of objects, parts, scenes,
textures, materials, and colors. The method reveals that deep representations
are more transparent and interpretable than expected: we find that
representations are significantly more interpretable than they would be under a
random equivalently powerful basis. We apply the method to interpret and
compare the latent representations of various network architectures trained to
solve different supervised and self-supervised training tasks. We then examine
factors affecting the network interpretability such as the number of the
training iterations, regularizations, different initializations, and the
network depth and width. Finally we show that the interpreted units can be used
to provide explicit explanations of a prediction given by a CNN for an image.
Our results highlight that interpretability is an important property of deep
neural networks that provides new insights into their hierarchical structure.Comment: *B. Zhou and D. Bau contributed equally to this work. 15 pages, 27
figure
Street-View Image Generation from a Bird's-Eye View Layout
Bird's-Eye View (BEV) Perception has received increasing attention in recent
years as it provides a concise and unified spatial representation across views
and benefits a diverse set of downstream driving applications. While the focus
has been placed on discriminative tasks such as BEV segmentation, the dual
generative task of creating street-view images from a BEV layout has rarely
been explored. The ability to generate realistic street-view images that align
with a given HD map and traffic layout is critical for visualizing complex
traffic scenarios and developing robust perception models for autonomous
driving. In this paper, we propose BEVGen, a conditional generative model that
synthesizes a set of realistic and spatially consistent surrounding images that
match the BEV layout of a traffic scenario. BEVGen incorporates a novel
cross-view transformation and spatial attention design which learn the
relationship between cameras and map views to ensure their consistency. Our
model can accurately render road and lane lines, as well as generate traffic
scenes under different weather conditions and times of day. The code will be
made publicly available
- …